How might the natural abstraction hypothesis lead to "alignment by default"?

John Wentworth has argued that there's a small but meaningful chance that advanced AI systems built using current techniques might end up "aligned by default" – i.e., they could learn, and be aligned with, human values, without requiring a lot of specific effort to align them.

The natural abstraction hypothesis is a claim that there are many “natural abstractions” – i.e., concepts/structures that a wide variety of minds would discover and use to reason about the world[1].

Certain categories (such as “tree”) are very powerful explanations of high level dynamics of the world. Therefore, we might expect all intelligent systems, even those with wildly differing structures, to categorize the world through them. One specific natural abstraction which is central to alignment is "human being", and specifically, the parts of human beings which entail our values.

If human values are part of a natural abstraction, then alignment might be very easy. Any superintelligent AI reasoning about the terrestrial system would be likely to discover them. Intuitively, this is because “human”, and by extension human values, are themselves natural abstractions. Anything wanting to make plans on Earth will need to have a concept of humans, since we have a big effect on it. Human values are an important part of what it means to be human, so they are also likely to be discovered by an AI in the course of forming the "human"natural abstraction. Even though specific values differ between cultures, some things that people care about are widely shared.

The hope is that if we give an AI training data to teach it a certain task, it might be more able to maximize its reward by using its model of human values to understand the reward structure rather than trying to derive the reward structure exclusively from the training data. In other words, it might be able to figure out the reward function by modeling the values of the people who specified the reward function. By analogy, if someone gave you a complex task to do, you would use your understanding of their goals to do it right, rather than exclusively relying on examples that they provided. By focusing on your understanding of the goals of the project, you are able to make use of your knowledge of people generally, and of them in specific.

This suggests using a supervised learning system to tell the AI to use its prediction of human values as a proxy for its goals, so that instead of directly asking “What would maximize my reward?”, it asks “What would a human want me to do?” (since that will maximize its reward). /by analogy, think of a child, If this is done right, then the reward it will gain by using the “predict what a human would want” proxy will be higher than from directly trying to discover its goal from unsupervised learning in its training set.


  1. The following is a quote from the linked post: “Our physical world abstracts well: for most systems, the information relevant “far away” from the system (in various senses) is much lower-dimensional than the system itself. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans. These abstractions are “natural”: a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world.” ↩︎